-
Notifications
You must be signed in to change notification settings - Fork 58
Refactor Ray Cluster/AppWrapper creation #650
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Ray Cluster/AppWrapper creation #650
Conversation
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
e5e8b9b to
f25fddc
Compare
Fiona-Waters
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Magnum opus for sure @Bobbins228 ! Will continue reviewing but some nitpicks and a question so far.
cfc037f to
dab3020
Compare
KPostOffice
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'll give this another pass tomorrow. Good work!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are these auto-generated or have you created them by hand?
Can you give more info about how they were created?
My main concern is how will we keep these up to date as the Ray API Spec evolves. I think this is definitely the correct approach I just want to raise this concern now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wanted to auto generate these but from what I can see in the KubeRay repo there is no valid Open API generator file -> they should look like this for ref: pet-store.yaml
So I based the models files on the way they should be auto generated from what I can see in the Python K8s API and the specs from raycluster_types.go.
I am not a fan of how I generated these files but given the alternative is to re-create the base template within build_ray_cluster.py I am not sure how we should proceed.
src/codeflare_sdk/cluster/cluster.py
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we turn this into something like:
cluster = Cluster.cluster_from_k8
I'm not a huge fan of is_retrieved_cluster flag leaking out into public apis. It feels very awkward to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate a bit more on this?
That bool exists to prevent a Ray Cluster from being built in build_ray_cluster.py. Other wise when we create the limited ClusterConfiguration the Cluster object is created but you would get duplicated print statements for "Written to: {output_file_name} from first creating the limited cluster then from creating a file from the retrieved cluster.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can create a class function which has different logic which also returns a cluster. Something like:
cluster = Cluster.new_cluster_skip_create(cluster_config)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are a hero, that is a great idea!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@KPostOffice WDYT of the new changes for get_cluster I have added?
When trying to initialise the Cluster object I was getting exception errors for not providing a ClusterConfiguration. I tried to think of a way around this and I opted to set the CC as none when using get_cluster which would allow me to set the config and resource_yaml after initialisation.
I in turn added a warning if a user tried to specify ClusterConfiguration as None. I feel like this is still a bit crude and would welcome any further suggestions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you want to spend some time pair programming this tomorrow? We can spitball some ideas
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah sounds good I can set up a meeting for later
dab3020 to
ee5ddc7
Compare
Signed-off-by: Bobbins228 <[email protected]>
Signed-off-by: Bobbins228 <[email protected]>
…ction doc Signed-off-by: Bobbins228 <[email protected]>
ee5ddc7 to
2ef248c
Compare
|
PR needs rebase. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Closing in favour of #751 |
Issue link
Closes: RHOAIENG-10385 and RHOAIENG-8846
What changes have been made
ray_versiona variable for potential future automationimageconfig variable to default toquay.io/rhoai/ray:2.23.0-py39-cu121create_resourceget_clustermethod to generate a new ClusterConfiguration with just thenameandnamespaceof the cluster and retrieved yaml. Addedis_appwrapperbool so that users can get AppWrappers/Ray Clusters_retrieved_clusterboolean forget_clustercommand to avoid generating a "false" Ray Cluster viaClusterConfigurationCluster Configurationdocumentation and added new doc for methods used when interacting with Ray Clusters/AppWrappers.Verification steps
Setup
Notebook server ODH/RHOAI/Local
git clone https://github.com/project-codeflare/codeflare-sdk.gitpoetry build- install if needed (pip install poetry)pip install --force-reinstall dist/codeflare_sdk-0.0.0.dev0-py3-none-any.whlTesting
All
ClusterConfigurationparameters must be tested with the new cluster creation method.Keep a special eye out for the following as they were the most complex to implement:
Automated Notebook testing should cover the functionality changed but I still suggest all parameters should be human verified.
Test the new and improved
get_cluster()function.NOTE: You can compare the original & retrieved clusters by setting
write_to_file=TrueonClusterConfigurationandget_cluster()cluster = get_cluster(cluster_name=<name>, namespace=<namespace>, is_appwrapper=False, write_to_file=True)cluster.methodscluster.down()thencluster.up()TODO
ImagePullSecrets# DoneV1RayClusterChecks